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Abstract. Wo analyze the conjugate gradient (CG) method with variable preconditioning for 
solving a linear system with a real symmetric positive definite (SPD) matrix of coefficients A. We 
assume that the preconditioner is SPD on each step, and that the condition number of the precon- 
ditioned system matrix is bounded above by a constant independent of the step number. We show 
that the CG method with variable preconditioning under this assumption may not give improvement, 
compared to the steepest descent (SD) method. We describe the basic theory of CG methods with 
variable preconditioning with the emphasis on "worst case" scenarios, and provide complete proofs of 
all facts not available in the literature. We give a new elegant geometric proof of the SD convergence 
rate bound. Our numerical experiments, comparing the preconditioned SD and CG methods, not 
only support and illustrate our theoretical findings, but also reveal two surprising and potentially 
practically important effects. First, we analyze variable preconditioning in the form of inner-outer 
iterations. In previous such tests, the unpreconditioned CG inner iterations are applied to an artifi- 
cial system with some fixed preconditioner as a matrix of coefficients. We test a different scenario, 
where the unpreconditioned CG inner iterations solve linear systems with the original system matrix 
A. We demonstrate that the CG-SD inner-outer iterations perform as well as the CG-CG inner-outer 
iterations in these tests. Second, we compare the CG methods using a two-grid preconditioning with 
fixed and randomly chosen coarse grids, and observe that the fixed preconditioner method is twice 
as slow as the method with random preconditioning. 
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1. Introduction. Preconditioning, a transformation, usually implicit, of the 
original linear system aiming at accelerating the convergence of the approximations to 
the solution, is typically a necessary part of an efficient iterative technique. Modern 
preconditioning, e.g., based on so-called algebraic multilevel and domain decomposi- 
tion methods, attempts to become as close to a "black box" ideal of direct solvers as 
possible. In this attempt, the mathematical structure of the preconditioner, which in 
the classical case is regarded as some linear transformation, may become very com- 
plex, in particular, the linearity can be easily lost, e.g., if the preconditioning itself 
involves "inner" iterative solvers. The fact that the preconditioner may be nonlin- 
ear, or variable, i.e., changing from iteration to iteration, may drastically affect the 
known theory as well as the practical behavior of preconditioned iterative methods 
and therefore needs special attention. Our main result is that the conjugate gradi- 
ent (CG) method with variable preconditioning in certain situations may not give 
improvement, compared to the steepest descent (SD) method for solving a linear sys- 
tem with a real symmetric positive definite (SPD) matrix of coefficients. We assume 
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that the preconditioncr is SPD on each step, and that the condition number of the 
preconditioned system matrix is bounded above by a constant. 

Let us now introduce the notation, so that we can formulate the main result 
mathematically. Let A be a real SPD matrix, (x, y) be the standard inner product 
of real vectors x and y, so that (Ax,y) = (x,Ay), and let ||x|| = yj (x, x) be the 
corresponding vector norm. We also use || • || to denote the operator norm. The A- 
inner product and the A-norm are denoted by (x, y)A = {x, Ay) and \\x\\a = y {x, x)a- 

We consider a family of iterative methods to obtain a sequence of approximate 
solutions Xk of a linear system Ax — b and use the A-norm to measure the error 
e k = x — Xk- The SD and CG methods arc well-known iterative procedures that fit 
into our framework. To accelerate the convergence of the error e k to zero we introduce 
preconditioning, i.e., on every iteration fc an operator B k , called the preconditioner, 
possibly different for each iteration fc, is applied to the residual r k = b—Ax k . A general 
algorithm, which includes the preconditioned SD or CG (PSD or PCG respectively) 
methods as particular cases, can be presented as follows, e.g., Axelsson [3, p. 540] 
and Axelsson and Vassilevski [1, Algorithm 5.3]: given A, b, {B k }, {wfe}, xo, for 
fc = 0, 1, . . .: r-fc = b - Axk,sk = B% r k , and 

{As k ,pi) (r k ,Pk) n 

Pk = S k ~ > y-j rPl, X k+ i=X k + — -p k , (1.1) 

,.ri [Apupi) (Ap k ,p k ) 

where 

< m k < fc and m k +i < m k + 1. (1-2) 

The latter condition is highlighted in Notay [11, p. 1447, line 1] and ensures that the 
formula for p k in (1.1) performs the standard Gram-Schmidt A-orthogonalizations to 
previous search directions, which arc already pairwise A-orthogonal. The full orthogo- 
nalization that performs explicit A-orthogonalizations to all previous search directions 
corresponds to m k = fc. Choosing m k = min{fc,l} gives the PCG method, e.g., de- 
scribed in Golub and Ye [6, IPCG Algorithm]. The connection of this PCG method to 
the commonly used PCG algorithm is discussed in section 7 following Golub and Ye 
[6, Remark 2.3]. The shortest recursion m k = leads to the standard PSD method. 

It is well-known, e.g., D'yakonov [4, p. 34] and Axelsson [3, Sec. 11.1.2, p. 458] 
that if the preconditioner is SPD and fixed, B k = B = B* > 0, a preconditioned 
method, such as (1.1), using the preconditioner B can be viewed as the corresponding 
unpreconditioned method applied to the preconditioned system B~ 1 Ax = B~ 1 b in the 
B-based inner product (x,y) B = (x,By). This implies that the theory obtained for 
unpreconditioned methods remains valid for preconditioned methods, in particular, 
the A-orthogonalization terms with I < fc — 1 in the sum in (1.1) vanish in exact arith- 
metic, e.g., Axelsson [3, Sec. 11.2.6, Th. 11.5]. The situation changes dramatically, 
however, if different preconditioners B k are used in the PCG method. 

The present paper concerns the behavior of method (1.1), where the precondi- 
tioner B k varies from step to step, but remains SPD on each step and the spectral 
condition number k (B^ 1 A) = A max (B^ 1 A) /A m i n (B^ 1 A) is bounded above by some 
constant K max independent of the step number fc. We note that the matrix B^ 1 A is 
SPD with respect to, e.g., the B k inner product, so its eigenvalues are real positive. 
Let us highlight that our assumption k (B^ 1 A) < K max can be equivalently written as 
|| I — B^ l A\\s k < 7 with K max = (1 + 7)/(l — 7), assuming without loss of generality 
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that -Bfc is scaled such that A max (B^A) + A m ; n (B^A) = 2. Here, we only deal with 
methods that are invariant with respect to scaling of B^. 

The main result of this paper is that the preconditioned method (1.1) with (1.2) 
turns into the PSD method with the worst possible convergence rate on every iteration, 
if the preconditioncrs B^ satisfying our assumption k (B^ 1 A) < /c max arc chosen in 
a special way. We explicitly construct a variable prcconditioncr that slows down the 
CG method to the point that the worst linear convergence rate of the SD method is 
recovered. Thus one can only guarantee that the convergence rate for the method (1.1) 
with (1.2) is just the same as for the PSD method, = 0, obtained in Kantorovic 
[7] and reproduced, e.g., in Kantorovich and Akilov [8, Ch. XV]: 

ll e fc+l \\a < K max ~ 1 

||efe|U ~ K max + 1 

Our proof is geometric and is based on the simple fact, proved in section 2, that a 
nonzero vector multiplied by all SPD matrices with a condition number bounded by 
a constant generates a pointed circular cone. We apply this fact on every iteration to 
the current residual vector, which becomes the center of the cone, so all points in the 
cone correspond to all possible preconditioned residuals. In a somewhat similar way, 
Golub and Ye [6] use the angle between the exact and the perturbed preconditioned 
residuals. In the CG method context, this cone has a nontrivial intersection with the 
subspace A-orthogonal to all previous search directions. So on each iteration we can 
choose a preconditioncr with the a priori chosen quality, determined by /t max , that 
makes enforcing A-orthogonality with respect to all previous search directions useless. 

Basic properties of method (1.1), most importantly the local optimality, are de- 
rived in section 3. In section 4 we apply our results from section 2 about the cone 
to obtain a new proof of estimate (1.3). In section 5 we analyze the convergence 
of the PCG method with variable preconditioning and prove our main result. We 
assume real arithmetic everywhere in the paper, except for section 6, where we show 
that our main results also hold for complex Hcrmitian positive definite matrices. In 
section 7 we consider two particular PCG algorithms that arc often used in practice 
and describe their behavior with variable preconditioning. 

Our numerical experiments in section 8 comparing the preconditioned SD and 
CG methods support and illustrate our theoretical findings, and also reveal some 
potentially practically important effects. In subsection 8.1, we test the widely used 
modification of the CG method with a simplified formula for the scalar pj~ from 
section 7 and demonstrate that variable preconditioning can make this modification 
much slower than even the SD method. In subsection 8.2, we analyze inner-outer 
iterations as variable preconditioning. Finally, in subsection 8.3, we demonstrate that 
variable preconditioning may surprisingly accelerate the SD and the CG compared to 
the use of fixed preconditioning in the same methods. 

Different aspects of variable preconditioning arc considered, e.g., in Axelsson and 
Vassilevski [1, 2], Axelsson [3], where rather general nonlinear preconditioning is in- 
troduced, and in Golub and Ye [6], Notay [11] that mainly deal with the case when 
the preconditioner on each iteration approximates a fixed operator. In Axelsson and 
Vassilevski [2], Axelsson [3], Golub and Ye [6], Notay [11], convergence estimates for 
some iterative methods with variable preconditioning are proved. For recent results 
and other aspects of variable preconditioning sec Simoncini and Szyld [12, 13, 14] and 
references there. No attempts are apparently made in the literature to obtain a result 
similar to ours, even though it should appear quite natural and somewhat expected 
to experts in the area, after reading this paper. 
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2. Pointed circular cones represent sets of SPD matrices with varying 
condition numbers. For a pair of real non-zero vectors x and y we define the angle 
between x and y in the usual way as 

Z(x,y) = arccos ( ,, ) G [0,tt]. 

VIfII IMI/ 

The following theorem is inspired by Neymeyr [10, Lemma 2.3]. 

Theorem 2.1. The set {Cx}, where x is a fixed nonzero real vector and C runs 
through all SPD matrices with condition number k(C) bounded above by some K max > 
is a pointed circular cone, specifically, 

{Cx: C = C*> 0, k(C) < Kmax} = \y : sin Z {x, y) < Kmax ~ ) 

1 ^max T" ^ 



Theorem 2.1 can be proved by constructing our cone as the smallest pointed cone 
that includes the ball considered in Neymeyr [10, Lemma 2.3]. Preparing for section 
6 that deals with the complex case, not covered in Neymeyr [10], we provide a direct 
proof here based on the following two lemmas. The first lemma is simple and states 
that the set in question cannot be larger than the cone: 

Lemma 2.2. Let x be a non-zero real vector, let C be an SPD matrix with spectral 
condition number k (C). Then sinZ (x, Cx) < (k (C) — l)/(re (C) + 1). 

Proof. Denote y = Cx. We have (x, Cx) = (y,C~ 1 y) > since C is SPD, so 
y 7^ and Z (x, y) < ir/2. A positive scaling of C and thus of y is obviously irrelevant, 
so let us choose y to be the orthogonal projection of x onto the 1-dimensional sub- 
space spanned by the original y. Then from elementary 2D geometry it follows that 
\\y — x\\ = \\x\\ sinZ (x, y). The orthogonal projection of a vector onto a subspace is 
the best approximation to the vector from the subspace, thus 

\\x\\smZ(x,y) = \\y - x\\ < \\sy - x\\ = \\sCx - x\\ < \\sC - I\\ \\x\\ 

for any scalar s, where I is the identity. Taking s = 2/ (A max (C) + A m i n (C)) , where 
Amin (C) and A max (C) are the minimal and maximal eigenvalues of C, respectively, 
we get ||sC-7|| = {k (C) - 1)/(k(C) + 1). □ 

The second lemma implies that every point in the cone can be represented as Cx 
for some SPD matrix C with k(C) determined by the opening angle of the cone. 

Lemma 2.3. Let x and y be non-zero real vectors, such that Z(x,y) G [0, JJ. 
Then there exists an SPD matrix C , such that Cx = y and 

k(C)-1 . 

^c)TT = sinZ(x ' 2/) - 



Proof. Denote a = Z{x,y). A positive scaling of vector y is irrelevant, so as 
in the previous proof we choose y to be the orthogonal projection of x onto the 1- 
dimensional subspace spanned by the original y, then \\y — x\\ = (sin a) ||x||, so the 
vectors y — x and (sin a)x are of the same length. This implies that there exists a 
Householder reflection H such that H ((sina)x) = y — x, cf. Neymeyr [10, Lemma 
2.3], so (L + (sin a) if) x = y. We define C = L + (sin a) H to get Cx = y. Any 
Householder reflection is symmetric and has only two distinct eigenvalues ±1, so C is 
also symmetric and has only two distinct positive eigenvalues lisina, as a G [0, 7r/2), 
and we conclude that C > and k (C) = (1 + sina)/(l — sin a). □ 
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3. Local optimality of the method with variable preconditioning. Here 
we discuss some basic properties of method (1.1) with (1.2). We derive a simple, but 
very useful, error propagation identity in Lemma 3.1. We prove in Lemma 3.2 that the 
method is well-dchncd and has a certain local A-orthogonality property, formulated 
without a proof in Notay [11, formulas (2.1)-(2.2)] and in the important particular 
case m k = min{fc, 1} proved in Golub and Ye [6, Lemma 2.1]. Using the local A- 
orthogonality property of Lemma 3.2, we prove the local A-optimality property in 
Lemma 3.3 by generalizing the result of Golub and Ye [6, Proposition 2.2]. Finally, 
we derive a trivial Corollary 3.4 from Lemma 3.3, which uses the idea from Golub 
and Ye [6, p. 1309] of comparison with the PSD method, m k = 0. 

The material of this section is inspired by Golub and Ye [6] and may be known 
to experts in the field, e.g., some even more general facts can be found in Axelsson 
[3, Sec. 12.3.2, Lemma 12.22]. We provide straightforward and complete proofs here 
suitable for a general audience. 

Lemma 3.1. Let A and {B k } be SPD matrices. Suppose pk in method (1.1) is 
well-defined and nonzero. Then 

(Ae k ,p k ) 

e k+1 = e k - — -p k . (3.1) 

(A Pk ,p k ) 

Proof. Recall that e k = A~ x b — x k and thus r k = Ae k . Then (3.1) follows 
immediately from the last formula in (1.1). □ 

Lemma 3.2. Let A and {B k } be SPD matrices and {m k } satisfies (1.2). Then 
the error, the preconditioned residual, and the direction vectors generated by method 
(1.1) before the exact solution is obtained are well-defined and satisfy 

{Pi,Pj) A = °> k - m k <i < j <k, (3.2) 
(efc+i, s k ) A = (ek+i,Pi) A = 0, k - m k < i < k. (3.3) 



Proof. We first notice that (3.1) for any k obviously implies 

(e k+1 ,p k ) A =0. (3.4) 

For the rest of the proof we use an induction in k. Let us take k = and suppose 
xq ^ x, then ro ^ and so ^ since Bq is SPD. By (1.2), mo = and thus 
Po = So 7^ 0, so in the formula for x k+ i we do not divide by zero, i.e., x k+ i is well 
defined. There is nothing to prove in (3.2) for k — since mo = 0. Formula (3.4) 
implies (ei,po),4 = ( e l> s o)^i = 0, i.e., (3.3) holds for k — 0. This provides the basis 
for the induction. 

Suppose the statement of the lemma holds for k — 1, which is the induction 
hypothesis, i.e., up to the index k — 1 all quantities are well defined and 

{PnPj) A = °> k-l - mfe_i < i < j < k - 1, (3.5) 

(e k ,s k -i) A = (e k ,pi) A = 0, k- 1- m fc _i <i<k — l. (3.6) 

We now show by contradiction that x k ^ x implies p k ^ 0. Indeed, if p k = 
then s k is a linear combination of p k - mkl ■ ■ ■ ,Pk—i- However, since m k < m k -\ + 1, 
it follows from (3.6) that 

{e-k,Pi) A = 0, k — m k <i<k- 1. (3.7) 
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Then we have (s k ,e k ) A = 0. At the same time, since the matrix B k X A is A-SPD, 
s k = B^ 1 Ae k cannot be A-orthogonal to unless Sk = e k = 0, i.e., x k — x. 

Next, we prove (3.2) by showing that the formula for p k in (1.1) is a valid step 
of the Gram-Schmidt orthogonalization process with respect to the ^4-based inner 
product. If rrik = 0, there is nothing to prove. If m k = 1 then (3.2) gets reduced 
to {pk,Pk-i) a = 0' which follows from the formula for pk in (1.1). If > 2 
then condition (1.2) implies that vectors pk- mk , ■ ■ ■ ,Pk-i are among the vectors 
Pfe-i— m,k-i i ■ ■ ■ iPk-i arid therefore arc already A-orthogonal by the induction assump- 
tion (3.5). Then the formula for pk in (1.1) is indeed a valid step of the Gram-Schmidt 
orthogonalization process with respect to the ^4-based inner product, so (3.2) holds. 

It remains to prove (3.3). We have already established (3.2), and (3.4)-(3.7). 
Equalities (3.2) and (3.7) imply that pk and e k are A-orthogonal to pk- mk , ■ ■ ■ ,Pk—i- 
Equality (3.1) implies that ek+i is a linear combination of and p k . Thus, we have 
( e k+i,Pi) a = 0, fc — m k < i < k— 1. Finally, it is enough to notice that Sk is a linear 
combination o{p k ,p k _ x , . . . ,p k - mk , so (e k+1 ,s k ) A = 0. □ 

We now use Lemma 3.2 to prove the local optimality of method (1.1) with (1.2), 
which generalizes the statement of Golub and Ye [6, Proposition 2.2]. 

Lemma 3.3. Under the assumptions of Lemma 3.2, 

l|efc+ilU = . min . hk~p\\ A - 

Proof. We get e k +i G e k + span{s fe 

i Pk—nik •)••••> Pk — 

i} from the formula for p k 

in (1.1) and (3.1). Putting this together with A-orthogonality relations (3.3) of the 
vector ek+i with all vectors that span the subspace finishes the proof. □ 

Two important corollaries follow immediately from Lemma 3.3 by analogy with 
Golub and Ye [6, Proposition 2.2]. 

Corollary 3.4. The A-norm of the error Hefc+iH^ in method (1.1) with (1.2) 
is bounded above by the A-norm of the error of one step of the PSD method, rrik = 0, 
using the same Xk as the initial guess and Bk as the preconditioner, i.e., specifically, 
||e/c+ilL < min Q \\e k - as k \\ A - 

Corollary 3.5. Let m k > 0, then the A-norm of the error ||efc+i||^ in method 
(1.1) with (1.2) for k > satisfies Hek+iH^ < min Qi/3 \\e k - as k - fi{e k ~ e k -i)\\ A . 

Proof. Under the lemma assumptions, the formula for p k in (1.1) and (3.1) imply 
that ek+i G efc + span{s / t,pfe_ 1 } = efc + span{sfc, ek — ek~i} , and the ^.-orthogonality 
relations (3.3) turn into (e k +i,s k )A = and (e k +i,Pk-i)A = (e k+ i,e k - ek-i)A = 0, 
so the vector e k +i is A-orthogonal to both vectors that span the subspace. As in the 
proof of Lemma 3.3, the local A-orthogonality implies the local A-optimality. □ 

Corollary 3.4 allows us in section 4 to estimate the convergence rate of method 
(1.1) with (1.2) by comparison with the PSD method, mk = 0, — this idea is borrowed 
from Golub and Ye [6, p. 1309]. The results of Lemma 3.3 and Corollary 3.5 seem 
to indicate that an improved convergence rate bound of method (1.1) with (1.2) can 
be obtained, compared to the PSD method convergence rate bound that follows from 
Corollary 3.4. Our original intent has been to combine Corollary 3.5 with convergence 
rate bounds of the heavy ball method, in order to attempt to prove such an improved 
convergence rate bound. However, our results of section 5 demonstrate that this 
improvement is impossible under our only assumption n {B^ 1 A) < K max , since one 
can construct such preconditioncrs Bk that, e.g., the minimizing value of (3 in Corollary 
3.5 is zero, so Corollary 3.5 gives no improvement compared to Corollary 3.4. 
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4. Convergence rate bounds for variable preconditioning. The classical 
Kantorovich and Akilov [8, Ch. XV] convergence rate bound (1.3) for the PSD method 
is "local" in the sense that it relates the A-norm of the error on two subsequent 
iterations and does not depend on previous iterations. Thus, it remains valid when 
the preconditioner B k changes from iteration to iteration, while the condition number 
k (-BjT 1 A) is bounded above by some constant K max independent of k. The goal of 
this section is to give an apparently new simple proof of the estimate (1.3) for the 
PSD method, based on our cone Theorem 2.1, and to extend this statement to cover 
the general method (1.1) with (1.2), using Corollary 3.4. 

We denote the angle between two real nonzero vectors with respect to the A-based 
inner product by 

L A (x, y) = arccos ( ^ ,V }\ j G [0, tt] 

and express the error reduction ratio for the PSD method in terms of the angle with 
respect to the A-based inner product: 

Lemma 4.1. On every step of the PSD algorithm, (1.1) with m k = 0, the error 
reduction factor takes the form ||efc_|_i||^/||efe||^ = sin(Z^(efc, B^ 1 Ae k )). 

Proof. By (3.3), we have (e k+ i,p k ) A = . Now, for m k = 0, in addition, 
Pk = Sfe, so = (e k +i,Pk)A = (e k+1 ,s k ) A = (e k+1 ,x k+ i -x k ) A , i.e., the triangle with 
vertices x, x k , x k +i is right-angled in the A-inner product, where the hypotenuse is 
e k = x-x k . Therefore, ||e fe+ i|U/ Ue*]^ = sm(Z A (e k ,x k+ i -x k )) = sm(Z A (e kl s k )), 
where s k = B^ 1 {b - Ax k ) = B^Ae k by (1.1). □ 

Let us highlight that Lemma 4.1 provides an exact expression for the error reduc- 
tion factor, not just a bound — we need this in the proof of Theorem 5.1 in the next 
section. Combining the results of Lemmas 2.2 and 4.1 together immediately leads to 
(1.3) for the PSD method, where m k = 0. Finally, taking into account Corollary 3.4, 
by analogy with the arguments of Golub and Ye [6, p. 1309] and decrypting a hidden 
statement in Golub and Ye [6, Lemma 3.5], we get 

Theorem 4.2. Convergence rate bound (1.3) holds for method (1.1) with (1.2). 

5. The convergence rate bound is sharp. Here we formulate and prove the 
main result of the paper that one can only guarantee the convergence rate described 
by (1.3) for method (1.1) with (1.2) with variable preconditioning if one only assumes 
K (B^ l A) < K max . Let us remind the reader that (1.3) also describes the convergence 
rate for the PSD method, (1.1) with m k = 0. We now show that adding more vectors 
to the PSD iterative recurrence results in no improvement in convergence, if a specially 
constructed set of variable preconditioncrs is used. 

Theorem 5.1. Let an SPD matrix A, vectors b and xq, and K max > 1 be given. 
Assuming that the matrix size is larger than the number of iterations, one can choose 
a sequence of SPD preconditioner s B k , satisfying k(B'^ 1 A) < K max > such that method 
(1.1) with (1.2) turns into the PSD method, (1.1) with m k = 0, and on every iteration 

ll e fc+llU _ K max - 1 ^ 
||efe||yi Kmax + 1 

Proof. We construct the sequence B k by induction. First, we choose any vector 
<?o, such that sin Z A (qo, eo) = (rt max — l)/(K max + !)• According to Lemma 2.3 applied 
in the A-inner product, there exists an A-SPD matrix Cq with condition number 
k(Cq) = Kmaxj such that CqCq = qo- We define the SPD Bq = ACq 1 , then k(Bq 1 A) = 
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k(C ) 



= K 



max 



. We have Sk = B 



Ae;., so such a choice of B implies s = q a . Also, 



we have po — s , i.e., the first step is always a PSD step, thus, by Lemma 4.1 we have 
proved (5.1) for k = 0. Note that (ei,po)A = by (3.3). 

Second, we make the induction assumption: let preconditioners Bi for I < k — 1 
be constructed, such that ||ej + i|| A /ll e /II.A = ( K m ax - l)/(K max + 1) and (e kl pi) A = 
hold for all Z < k — 1 . The dimension of the space is greater than the total number of 
iterations by our assumption, so there exists a vector Uk, such that {uk,pi)A = for 
I < k — 1 and Uk and ek are linearly independent. Then the 2D subspace spanned by 
Uk and e k is A-orthogonal to pi for I < k — 1. 

Let us consider the boundary of the pointed circular cone made of vectors qk 
satisfying the condition sin Z^q^, ek) — (K max — l)/(K max + 1). This conical surface 
has a nontrivial intersection with the 2D subspace spanned by and e^, since is 
the cone axis. Let us choose vector in the intersection, This vector will be obviously 
A-orthogonal to pi, I < k — 1. 

Applying the same reasoning as for constructing Bq, wc deduce that there exists 
an SPD Bk such that n(B^ 1 A) < K max and B^ 1 Aek = qk- With such a choice of Bk 
we have Sk = qk- Since qk = Sk is ^4-orthogonal to pi for alH < k— 1, it turns out that 
Pfc = Sfc, no matter how {mfe} are chosen. This means that Xk+i is obtained from Xk 
by a steepest descent step. Then we apply Lemma 4.1 and conclude that (5.1) holds. 
We note, that (ek+\,pi)A = for all I < k. Indeed, (ek+i 7 pi)A = for all I < k — 1 
since e&+i is a linear combination of and pk = Sk = Qk, both A-orthogonal to pi 
for I < k — 1. Finally, (efc + i,pfc)A = by (3.3). This completes the construction of 
{Bk} by induction and thus the proof. □ 

Let us highlight that the statement of Theorem 5.1 consists of two parts: first, 
it is possible to have the PCG method with variable preconditioning that converges 
not any faster than the PSD method with the same preconditioning; and second, 
moreover, it is possible that the PCG method with variable preconditioning converges 
not any faster than the worst possible theoretical convergence rate for the PSD method 
described by (1.3). Numerical tests in section 8 show that the former possibility is 
more likely than the latter. Specifically, we demonstrate numerically in subsection 
8.3 that the PCG and PSD methods with random preconditioning converge with the 
same speed, but both are much faster than what bound (1.3) predicts. 

6. Complex Hermitian case. In all other sections of this paper we assume for 
simplicity that matrices and vectors are real. However, our main results also hold 
when matrices A and {Bk} are complex Hermitian positive definite. In this section 
we discuss necessary modifications to statements and proofs in sections 2, 4 and 5 in 
order to cover the complex Hermitian case. 

In section 2, the first thing to be changed is the definition of the angle between 
two non-zero vectors i,i/£ C™, where an absolute value is now taken, 



that makes the angle acute and invariant with respect to complex nonzero scaling of 
the vectors. Lemma 2.2 remains valid in the complex case: 

Lemma 6.1. Let x be a non-zero complex vector, and C be a complex Hermitian 
positive definite matrix with the spectral condition number k (C), then sinZ (x, Cx) < 



(«(C)-1)/(«(C)+1). 

Proof. Denote y — Cx and let 7 — (?/, x) / \\y\\ then is the projection of x onto 
span{y}, and Z(x,y) = Z(x,jy). Moreover, (x,jy) = {x,y)^f = {x 1 y){y,x)/\\y\\ 2 is 



Z(x, y) = arccos 




IMI IMI 
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real — we need this fact later in the proof of Lemma 6.2. We redefine y to jy. The 
rest of proof is exactly the same as that of Lemma 2.2, since the identity \\y — x\\ = 
\\x\\ sin Z (x, y) , where y is scaled by a complex scalar to be the orthogonal projection 
of x onto span{y}, holds in the complex case with the new definition of the angle. □ 

Lemma 2.3 and, thus, Theorem 2.1 do not hold in the complex case after the 
straightforward reformulation. A trivial counterexample is a pair of vectors x =/= 
and y = ix — the angle between x and y is obviously zero, yet it is impossible that 
y = Cx for any complex Hermitian matrix C, since the inner product (x, y) = — z||x|| 2 
is not a real number. This counterexample also gives an idea for a simple fix: 

Lemma 6.2. Let x and y be non-zero complex vectors, such that Z (x,y) ^ ir/2. 
Then there exists a complex Hermitian positive definite matrix C and a complex scalar 
7, such that Cx = jy and (k (C) — 1)/(k (C) + 1) = sin Z (x, y) . 

Proof. We first scale the complex vector y as in the proof of Lemma 6.1 to make 
y to be the projection of x onto span{j/}. The rest of the proof is similar to that of 
Lemma 2.3, but we have to be careful working with the Householder reflection in the 
complex case, so we provide the complete proof. 

The redefined y is the projection of x onto spanjy}, thus, \\y — x\\ — (sin a) ||x||, 
so the vectors u — y — x and v = (sina)a; are of the same length. Moreover, their inner 
product (u, v) is real, since (x, y) is real, see the proof of Lemma 6.1. This implies 
that the Householder reflection Hz = z — 2(w, z)w, where w = (u — v)/\\u — v\\, acts 
on z = u such that Hu = v, i.e., H ((sin a) x) = y — x, so (7 + (sin a) H) x = y. We 
define C = I + (sin a) H to get Cx = y. 

The Householder reflection H is Hermitian and has only two distinct eigenvalues 
±1, so C is also Hermitian and has only two distinct positive eigenvalues 1 ± sin a, as 
a E [0, it/2), and we conclude that C > and k (C) = (1 + sina)/(l — sin a). □ 

The same change then makes Theorem 2.1 work in the complex case: 

Theorem 6.3. The set {^/Cx}, where x is a fixed nonzero complex vector, 7 
runs through all nonzero complex scalars, and C runs through all complex Hermitian 
positive definite matrices with condition number k(C) bounded above by some K max , 
is a pointed circular cone, specifically, 

{ 7 Cx: 1 ^0,C = C* >0, k(C) </w}= ly: sin Z (x, y) < \ 1 . 

t K max T J- J 



Section 3 requires no changes other then replacing "SPD" with "Hermitian posi- 
tive definite." In section 4 we just change the definition of the A-angle to 



-A {x,y) 



(x,y), 



\ x \\a WvWa 



and then Lemma 4.1 holds without any further changes. 

Finally, the statement of Theorem 5.1 from section 5 allows for a straightforward 
generalization: 

Theorem 6.4. Let a Hermitian positive definite matrix A, complex vectors b and 
xo, and /c m ax > 1 be given. Assuming that the matrix size is larger than the number 
of iterations, one can choose a sequence of Hermitian positive definite preconditioners 
Bk, satisfying n(B^ 1 A) < K maX ; such that method (1.1) with (1.2) turns into the PSD 
method, (1.1) with = 0, and on every iteration 

ll e fc+l \\a _ K max ~ 1 ^ 
lle/JL K max + 1 
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Proof. Only a small change in the proof of Theorem 5.1 is needed. We first choose 
any vector q' , satisfying sinZ^go, eo) = (t m ax — l)/(K ma x + !)• Then by Lemma 6.2 
we obtain the complex Hcrmitian positive definite matrix Cq and the complex scalar 
7 such that Cgeo = "fq . Finally, we choose q$ to be jq' and continue as in the proof 
of Theorem 5.1. The same modification is made in the choice of the vectors qu for 
k > 1 later in the proof. □ 

7. Practical PCG algorithms. In this section we briefly discuss two particular 
well-known PCG algorithms that are often used in practice. Our discussion here is 
motivated by and follows Golub and Ye [6, Remark 2.3]. Suppose A, b, x$, r$ = 
b — Axo, {Bk} for k = 0, 1, . . . are given and consider Algorithm 7.1 where (3k on line 
7.1 is defined either by expression 



or by expression 



fa= , y (7.1) 



(s k ,r k - rfc-i) 

Pk = — f n-- 

(s fc _i,r fc _ij 



Formula (7.1) is more often used in practice compared to (7.2), since it can be imple- 
mented in such a way that does not require storing the extra vector r*fc_i. 

Algorithm 7.1 

for k = 0, 1, . . . do 

Sk = -Bfc Vfc 

if k = then 

Po = so 
else 

Pk = s k+PkPk-i (where Pk is defined by either (7.2) or (7.1) for all iterations) 
end if 

(sk,r k ) 

(p k ,Ap k ) 
Xk+i = x k + akPk 
rk+i =r k - akApk 
end for 



If the preconditioner is SPD and fixed, it is well-known, e.g., Golub and Ye [6, 
Remark 2.3], that (sk,rk-i) = 0, so formula (7.2) coincides with (7.1) and Algorithm 
7.1 is described by (1.1) with vtik = min (k, 1). Of course, in this case the choice nik = 
min (fc, 1) is enough to keep all search directions A-orthogonal in exact arithmetic. 

Things become different when variable preconditioning is used. It is well-known, 
e.g., Golub and Ye [6, Remark 2.3] and Notay [11, Table 2], that using formula (7.1) 
for (3k can significantly slow down the convergence, and we provide our own numerical 
evidence of that in section 8. At the same time, comparing Lemma 3.2 with Lemma 2.1 
from Golub and Ye [6], we can show, see Knyazev and Lashuk [9, vl], that Algorithm 
7.1 with (3k defined by (7.2), which is exactly Golub and Ye [6, IPCG Algorithm], is 
equivalent to the particular case of (1-1), namely with rrik = min (fc, 1), and therefore 
is guaranteed by Theorem 4.2 to converge with at least the same speed as the PSD 
method. 
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Iterative errors for PCG methods 
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Fig. 8.1. Algorithm 7.1 with (7.1) fails to provide the PSD convergence rate. 



8. Numerical experiments. Wc first illustrate the main theoretical results of 
the paper numerically for a model problem. Wc numerically investigate the influence 
of the choice for j3k between formulas (7.1) and (7.2) in Algorithm 7.1 and observe 
that (7.2) leads to the theoretically predicted convergence rate, while (7.1) may sig- 
nificantly slow down the convergence. Second, we test the convergence of inner-outer 
iteration schemes, where the inner iterations play the role of the variable precondition- 
ing in the outer PCG iteration, and we illustrate our main conclusion that variable 
preconditioning may effectively reduce the convergence speed of the PCG method to 
the speed of the PSD method. Third, and last, we test the PSD and PCG methods 
with prcconditioncrs of the same quality chosen randomly. We observe a surprising 
acceleration of the PCG method compared to the use of only one fixed prcconditioner; 
at the same time, we show that the PSD method with random prcconditioncrs works 
as well as the PCG method, which explains the PCG acceleration and again supports 
our main conclusion. 

8.1. Numerical illustration of the main results. Here, we use the standard 
3-point approximation of the 1-D Laplacian of the size 200 as the matrix A of the 
system. To simulate the application of the variable preconditioner, we essentially 
repeat the steps described in the proof of Theorem 5.1, i.e., we fix the condition 
number k (B^ 1 A) = 2 and on each iteration we generate a pseudo-random vector 
Sfe, which is A-orthogonal to previous search directions and such that the A-angle 
between Sk and ek satisfies sin(ZA (sfc,efc)) = (k — l)/(n+ 1). 

We summarize the numerical results of this subsection on Figure 8.1, where the 
horizontal axis represents the number of iterations and the vertical axis represents the 
yl-norm of the error. The iteration count actually starts from 1, so the A- norm of the 
error on the 0-th iteration ||eo is just the A-norm of the initial error. The straight 
dotted (red in the colored print) line marked with squares on Figure 8.1 represents 
the PSD theoretical bound (1.3) and at the same time it perfectly coincides, which 
illustrates the statements of Theorem 5.1, with the change of the ^4-norm of the error 
in the case where the complete A-orthogonalization is performed, i.e., nik = k in 
method (1.1), as well as in the case where Algorithm 7.1 with (3k defined by (7.2) is 
used. The curved solid (blue) line marked with diamonds represents the convergence 
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Fig. 8.2. The PSD and PCG methods with preconditioning by inner CG with 
different stopping criteria i] = 0.2,0.4,0.6 and 0.8 (from the bottom to the top). 



of Algorithm 7.1 with f3k defined by (7.1), which visibly performs much worse in this 
test compared to Algorithm 7.1 with (7.2). The paper Notay [11, Sec. 5.2] contains 
analogous results comparing the change in the convergence rate using formulas (7.1) 
and (7.2), but it misses a comparison with the PSD method. To check our results 
of section 6, we repeat the tests in the complex arithmetic. The figure generated is 
similar to Figure 8.1, so we do not reproduce it here. 

8.2. Inner-outer iterations as variable preconditioning. Inner-outer iter- 
ative schemes, where the inner iterations play the role of the variable preconditioncr 
in the outer PCG iteration is a traditional example of variable preconditioning; see, 
e.g., Golub and Ye [6], Notay [11]. Previously published tests analyze an approxi- 
mation of some fixed preconditioncr, Bk ~ B, different from A, by inner iterations, 
typically using the PCG method. The quality of the approximation is determined 
by the stopping criteria of the inner PCG method. A typical conclusion is that the 
performance of the outer PCG method improves and it starts behaving like the PCG 
method with the fixed preconditioner B when B^ approximates B more accurately 
by performing more inner iterations. 

The idea of our tests in this subsection is different: we approximate B^ ss B = A. 
The specific setup is the following. We take a diagonal matrix A with all integer en- 
tries from 1 to 2000, with the right-hand side zero and a random normally distributed 
zero mean initial guess, and we do the same for the PSD and PCG methods. For 
preconditioning on the fc-th step, applied to the residual r-fc, we run the standard CG 
method without preconditioning as inner iterations, using the zero initial approxima- 
tion, and for the stopping criteria we compute the norm of the true residual at every 
inner iteration and iterate until it gets smaller than for a given constant n. On 

Figure 8.2, we demonstrate the performance of the PSD and PCG methods for four 
values of 77 = 0.2,0.4,0.6 and 0.8 (from the bottom to the top). We observe that 
the PSD, displayed using dashed (red in the colored print) lines marked with circles 
and PCG shown as dash-dot (blue) lines with x-marks methods both converge with a 
similar rate, for each tested value of 77. We notice here that the PSD method is even 
a bit faster than the PCG method. This does not contradict our Corollary 3.4, since 
the preconditioners Bk here are evidently different in the PSD and PCG methods 
even though they are constructed using the same principle. 
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Fig. 8.3. Two- grid preconditioning with fixed (left) and random (left) coarse grids. 

8.3. Random vs. fixed preconditioning. In this subsection, we numerically 
investigate a situation where random preconditioners of a similar quality are used in 
the course of iterations. The system matrix is the standard 3-point finite-difference 
approximation of the one-dimensional Laplacian using 3000 uniform mesh points and 
the Dirichlct boundary conditions. We test the simplest multigrid preconditioning 
using two grids, where the number of coarse grid points is 600. The interpolation is 
linear, the restriction is the transpose of the interpolation, and the coarse-grid operator 
is defined by the Galerkin condition. The smoother is the Richardson iteration. On 
Figure 8.3 left, we once choose (pscudo-)randomly 600 coarse mesh points and build 
the fixed two-grid prcconditioncr, based on this choice. On Figure 8.3 right, we choose 
600 new random coarse mesh points and rebuild the two-grid prcconditioncr on each 
iteration. We note that in the algebraic multigrid the geometric information about 
the actual position of the coarse grid points is not available, so the random choice of 
the coarse grids may be an interesting alternative to traditional approaches. 

Figure 8.3 displays the convergence history for the PSD (top), PCG (middle), 
and PCG with the full orthogonalization (bottom) with the same random initial guess 
using the fixed (left) and variable (right) two-grid preconditioners. On Figure 8.3 left, 
for a fixed preconditioner, we observe the expected convergence behavior, with the 
PSD being noticeably the slowest, and the PCG with the full orthogonalization being 
slightly faster than the standard PCG. Figure 8.3 right demonstrates that all three 
methods with the variable random prcconditioncr converge with essentially the same 
rate, which again illustrates the main result of the paper that the PCG method with 
variable preconditioning may just converge with the same speed as the PSD method. 

Figure 8.3 reveals a surprising fact that the methods with random preconditioning 
converge twice as fast as the methods with fixed preconditioning! We highlight that 
Figure 8.3 shows a typical case, not a random outlier, as we confirm by repeating the 
fixed preconditioner test in the left panel for every random prcconditioncr used in 
the right panel of Figure 8.3 and by running the tests multiple times with different 
seeds. Our informal explanation for the fast convergence of the PSD method with 
random preconditioning is based on Lemma 4.1 that provides the exact expression 
for the error reduction factor as sin(Z J 4(efe, B^ 1 Aek))- It takes its largest value only 
if e& is one of specific linear combination of the eigenvectors of B^ 1 Ae corresponding 
to the two extreme eigenvalues. If Bk is fixed, the error ek in the PSD method after 
several first iterations approaches these magic linear combinations, e.g., Forsythe [5], 
and the convergence rate reaches its upper bound. If Bk changes randomly, as in our 
test, the average "effective" angle is smaller, i.e., the convergence is faster. 
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Conclusions. We use geometric arguments to investigate the behavior of the 
PCG methods with variable preconditioning under a rather weak assumption that 
the quality of the preconditioncr is fixed. Our main result is negative in its nature: 
we show that under this assumption the PCG method with variable preconditioning 
may converge as slow as the PSD method, moreover, as the PSD method with the 
slowest rate guaranteed by the classical convergence rate bound. In particular, that 
gives the negative answer, under our assumption, to the question asked in Golub and 
Ye [6, Sec. 6, Conclusion.] whether better bounds for the steepest descent reduction 
factor may exists for Algorithm 7.1 with (7.2). 

Stronger assumptions on variable preconditioning, e.g., such as made in Golub 
and Ye [6], Notay [11] that the variable preconditioners are all small perturbations 
of some fixed preconditioncr, are necessary in order to hope to prove a convergence 
rate bound of the PCG method with variable preconditioning resembling the standard 
convergence rate bound of the PCG method with fixed preconditioning. Such stronger 
assumptions hold in many presently known real life applications of the PCG methods 
with variable preconditioning, but often require extra computational work, e.g., more 
inner iterations in the inner-outer iterative methods. 
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